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Abstract 

We present a new computational approach to approximating a large, noisy data ta- 
ble by a low-rank matrix with sparse singular vectors. The approximation is obtained 
from thresholded subspace iterations that produce the singular vectors simultaneously, 
rather than successively as in competing proposals. We introduce novel ways to esti- 
mate thresholding parameters which obviate the need for computationally expensive 
cross-validation. We also introduce a way to sparsely initialize the algorithm for com- 
putational savings that allow our algorithm to outperform the vanilla SVD on the full 
data table when the signal is sparse. A comparison with two existing sparse SVD 
methods suggests that our algorithm is computationally always faster and statistically 
always at least comparable to the better of the two competing algorithms. 

Key Words: Cross-validation; Denoising; Low-rank matrix approximation; Penaliza- 
tion; Principal component analysis; Power iterations; Thresholding. 

1 Introduction 

Singular value decompositions (SVD) and principle component analyses (PCA) are the foun- 
dations for many applications of multivariate analysis. They can be used for dimension re- 
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duction, data visualization, data compression and information extraction by extracting the 
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multivariate methods have escalated as the dimensionality of data sets has grown rapidly in 
such fields as genomics, imaging, financial markets. A critical issue that has arisen in large 
datasets is that in very high dimensional settings classical SVD and PCA can have poor 
statistical properties ( Shabalin and Nobel||2010 Nadler 2009 Paul 2007 and Johnstone and 



Lu|2009[ ). The reason is that in such situations the noise can overwhelm the signal to such an 
extent that traditional estimates of SVD and PCA loadings are not even near the ballpark of 
the underlying truth and can therefore be entirely misleading. Compounding the problems 
in large datasets are the difficulties of computing numerically precise SVD or PCA solutions 
at affordable cost. Obtaining statistically viable estimates of eigenvectors and eigenspaces 
for PCA on high-dimensional data has been the focus of a considerable literature; a repre- 
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and Johnstone (2007), Shen and Huang (2008), Johnstone and Lu (2009), Shen et al. (2011) 



Ma (2011). On the other hand, overcoming similar problems for the classical SVD has been 



the subject of far less work, pertinent articles being Witten et al. (2009), Lee et al. (2010a), 



Huang et al. 


(2009 


) and 
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(2011 



In the high dimensional setting, statistical estimation is not possible without the assump- 
tion of strong structure in the data. This is the case for vector data under Gaussian sequence 



models (Johnstone, 2011), but even more so for matrix data which require assumptions such 
as low rank in addition to sparsity or smoothness. Of the latter two, sparsity has slightly 
greater generality because certain types of smoothness can be reduced to sparsity through 



suitable basis changes (Johnstone, 2011). By imposing sparseness on singular vectors, one 
may be able to "sharpen" the structure in data and thereby expose "checkerboard" patterns 
that convey biclustering structure, that is, joint clustering in the row- and column-domains 
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Tibshirani (2010) used sparsity to develop a novel form of hierarchical clustering. 

So far we implied rather than explained that SVD and PCA approaches are not identical. 
Their commonality is that both apply to data that have the form of a data matrix X = (xij) 
of size n X p. The main distinction is that the PCA model assumes the rows of X to 
be i.i.d. samples from a p-dimensional multivariate distribution, whereas the SVD model 
assumes the rows i = 1, 2, n to correspond to a "fixed effects" domain such as space, time, 
genes, age groups, cohorts, political entities, industry sectors, .... This domain is expected 
to have near-neighbor or grouping structure that will be reflected in the observations Xij 
in terms of smoothness or clustering as a function of the row domain. In practice, the 
applicability of either approach is often a point of debate (e.g., should a set of firms be 
treated as a random sample of a larger domain or do they constitute an enumeration of the 
domain of interest?), but in terms of practical results the analyses are often interchangeable 
because the points of difference between the SVD and PCA models are immaterial in the 
exploratory use of these techniques. The main difference between the models is that the 
SVD approach analyzes the matrix entries as structured low-rank means plus error, whereas 
the PCA approach analyzes the covariation between the column variables. 

In modern developments of PCA, interest is focused on "functional" data analysis situ- 



ations or on the analog of the "sequence model" (Johnstone, 2011) where the columns also 
correspond to a structured domain such as space, time, genes, ... . It is only with this focus 
that notions of smoothness and sparseness in the column or row domain are meaningful. 
A consequence of this focus is the assumption that all entries in the data matrix have the 
same measurement scale and unit, unlike classical PCA where the columns can correspond 
to arbitrary quantitative variables with any mix of units. With identical measurement scales 
throughout the data matrix, it is meaningful to entertain decompositions of the data into 
signal and fully exchangeable noise: 

X = E + Z, (1) 
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where S = {^ij) is an n x p matrix representing the signal and Z = (zij) is an n x p random 
matrix representing the noise and consisting of i.i.d. errors as its components. In both PCA 
and SVD approaches, the signal is assumed to have a multiplicative low-rank structure: 
S = UDV = Yll=i^i^i^'h where for identifiability it is assumed that rank r < min(r;,,p), 
usually even such as r = 1, 2 or 3. The difference between SVD and PCA is, using 
ANOVA language, that in the SVD approach both U and V represent fixed effects that can 
both be regularized with smoothness or sparsity assumptions, whereas in functional PCA U 
is a random effect. As indicated above, such regularization is necessary for large n and p 
because for realistic signal-to-noise ratios recovery of the true U and V may not be possible. 
— Operationally, estimation under sparsity is achieved through thresholding. In general, if 
both matrix dimensions are thresholded, one obtains sparse singular vectors of X; if only 
the second matrix dimension is thresholded, one obtains sparse eigenvectors of X'X, which 
amounts to sparse PCA. 

A few recent papers propose sparsity approaches to the high dimensional SVD problem: 



Witten et al. (2009) introduced a matrix decomposition which constrains the h norm of 



the singular vectors to impose sparsity on the solutions. Lee et al. (2010a) used penalized 
LS for rank-one matrix approximations with li norms of the singular vectors as additive 
penalties. Both methods use iterative procedures to solve different optimization problems. 



[We will give more details about these two methods in Section pi] Allen et al. (2011) is a 



Lagrangian version of Witten et al. ( 2009 ) where the errors are permitted to have a known 
type of dependence and/or heteroscedasticity. These articles focus on estimating the first 
rank-one term given by (ii, Ui, Vi by either constraining the li norm of Ui and Vi or adding it 
as a penalty. To estimate d2,U2,^2 for a second rank-one term, they subtract the first term 
diUiv'^ from the data matrix X and repeat the procedure on the residual matrix. There 



exists further related work on sparse matrix factorization, for example, by Zheng et al. 



(2007), Mairal et al. (2010) and Bach et al. (2008), but these do not have the form of a SVD. 



In our simulations and data examples we use the proposals by Witten et al. (2009) and Lee 



et al. (2010a) for comparison. 



Our approach is to estimate the subspaces spanned by the leading singular vectors si- 
multaneously. As a result, our method yields sparse singular vectors that are orthogonal, 



unlike the proposals by Witten et al. (2009) and Lee et al. (2010a). In terms of statistical 
performance, simulations show that our method is competitive with the better performing 



of the two proposals, which is generally Lee et al. (2010a). In terms of computational speed, 
our method is faster by at least a factor of two compared to the more efficient of the two 



proposals, which is generally Witten et al. (2009). Thus we show that the current state of the 
art in sparse SVDs is "inadmissible" if measured by the two metrics 'statistical performance' 
and 'computational speed': our method closely matches the better statistical performance 
and provides it at a fraction of the better computational performance. In fact, by making 
use of sparsity at the initialization stage, our method also beats the conventional SVD in 
terms of speed. 

Lastly, our method is grounded in asymptotic theory that comprises minimax results 
which we describe in a companion paper (Yang et al. ( 2011[ )). A signature of this theory is 
that it is not concerned with optimization problems but with a class of iterative algorithms 
that form the basis of the methodology proposed here. As do most asymptotic theories in 
this area, ours relies heavily on Gaussianity of noise, which is the major aspect that needs 
robustification when turning theory into methodology with a claim to practical applicabil- 
ity. Essential aspects of our proposal therefore relate to lesser reliance on the Gaussian 
assumption. 

The present article is organized as follows. Section [2] describes our method for computing 
sparse SVDs. Section |3] shows simulation results to compare the performance of our method 



with that of Witten et al. (2009) and Lee et al. (2010a). Section |4| applies our and the 
competing methods to real data examples. Finally, Section |5] discusses the results and open 
problems. 
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2 Methodology 

In this section, we give a detailed description of the proposed sparse SVD method. 

To start, consider the noiseless case. Our sparse SVD procedure is motivated by the 



simultaneous orthogonal iteration algorithm (Golub and Van Loan 1996, Chapter 8), which 



is a straightforward generalization of the power method for computing higher-dimensional 
invariant subspaces of symmetric matrices. For an arbitrary rectangular matrix S of size nxp 
with SVD H = UDV, one can find the subspaces spanned by the first r (1 < r < min{n,p)) 
left and right singular vectors by iterating the pair of mappings V ^ U and U ^ V with S 
and S' (its transpose), respectively, each followed by orthnormalization, until convergence. 
More precisely, given a right starting frame V^'^\ that is, a p x r matrix with r orthonormal 
columns, the SVD subspace iterations repeat the following four steps until convergence: 



(1) Right-to-Left Multiplication: 


jj'{k),mul _ 




(2) Left Orthonormalization with QR Decomposition: 






(3) Left-to-Right Multiplication: 






(4) Right Orthonormalization with QR Decomposition: 




'^{k),mul 



(2) 



The superscript ^''^ indicates the fc'th iteration, and "'"^ the generally non-orthonormal inter- 
mediate result of multiplication. For r = 1, the QR decomposition step reduces to normal- 
ization. If S is symmetric, the second pair of steps is the same as the first pair, hence the 
original orthogonal iteration algorithm for symmetric matrices is a special case of the above 
algorithm. 

The problems our approach addresses are the following: For large noisy matrices in which 
the significant structure is concentrated in a small subset of the matrix X, the classical 
algorithm outlined above produces estimates with large variance due to the accumulation 



of noise from the majority of structureless cells (Shabalin and Nobel, 2010). In addition to 



the detriment for statistical estimation, involving large numbers of structureless cells in the 
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calculations adds unnecessary computational cost to the algorithm. Thus, shaving off cells 
with little apparent structure has the promise of both statistical and computational benefits. 
This is indeed borne out in the following proposal for a sparse SVD algorithm. 



2.1 The FIT-SSVD Algorithm: 

"Fast Iterative Thresholding for Sparse SVDs" 

Unsurprisingly, the algorithm to be proposed here involves some form of thresholding, be 
it soft or hard or something inbetween. All thresholding schemes reduce small coordinates 
in the singular vectors to zero, and additionally such schemes may or may not shrink large 
coordinates as well. Any thresholding reduces variance at the cost of some bias, but if 
the sparsity assumption is not too unrealistic, the variance reduction will vastly outweigh 
the bias inflation. The obvious places for inserting thresholding steps are right after the 
multiplication steps. If thresholding reduces a majority of entries to zero, the computational 
cost for the subsequent multiplication and QR decomposition steps is much reduced as well. 
The iterative procedure we propose is schematically laid out in Algorithm [T] 

In what follows we discuss the thresholding function and convergence criterion of Al- 
gorithm [T| Subsequently, in Sections 2^-2^, we describe other important aspects of the 



algorithm: the initialization of the orthonormal matrix, the target rank, and the adaptive 
choice of threshold levels. 

Thresholding function At each thresholding step, we perform entry-wise thresholding. 
In our modification of the subspace iterations ^ we allow any thresholding function r](x,'j) 
that satisfies |?7(x,7) — x| < 7 and ?7(x, 7)l|a;|<^ = 0, which includes soft-thresholding with 
Vsoft{x,'y) = sign(x)(|x| -7)+, hard-thresholding with 

Vhardi^i 'y) — •^l|a;|>75 Wcll aS the 



thresholding function used in SCAD (Fan and Li, 2001). The parameter 7 is called the 



threshold level. In Algorithm [T| we apply the same threshold level jui (or 7„/) to all the 
elements in the Ith column of ijW'^^^ (^or resp.). For more details on threshold 
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Input: 

1. Observed data matrix X. 

2. Target rank r. 

3. Thresholding function t]. 

4. Initial orthonormal matrix V^^^ G W^^. 

5. Algorithm / to calculate the threshold level 7 = f{X, U, V, a) given 

(a) the data matrix X, (b) current estimates of left and right singular vectors U, V, 
and (c) an estimate of the standard deviation of noise a. 
(Algorithm [3] is one choice.) 
Output: Estimators U = f/(°°) and V = V^°°\ 
1 Set (T = 1.4826 MAD (as. vector (X)). 
repeat 

Right-to-Left Multiplication: f/(fc)'™"' = XV^^'-^l 
Left Thresholding: f/C^)'*'^'' = (wf with u^'^'''''' = rj (wf 7«/) , 
where 7„ = f{X, U^^-'^\ V^^~'^\ a). 

Left Orthonormalization with QR Decomposition: U^^'^R^u^ = f/(*^)'*'*^. 
Left-to-Right Muhiplication: V^C^)."^"^ = X'U^''\ 

Right Thresholding: V^'^^'^^ = {v^i^'"''), with v^^^'""' = r] (t;]?^''""', 7.z) , 
where 7^ = /(X', V'^^'^\ U^''\a). 
Right Orthonormalization with QR Decomposition: 
until Convergence; 



Algorithm 1: FIT-SSVD 



levels, see Section 2.4 



Convergence criterion We stop the iterations once subsequent updates of the orthonor- 
mal matrices are very close to each other. In particular, for any matrix H with orthonormal 
columns (that is, H'H = I), let = HH' be the associated projection matrix. We stop 
after the kth. iteration if max{||P^(fc) — P^(fc-i)||2, H-Pyc^) ~ -^Vc^-i) II2} — ^! where e is a pre- 
specified tolerance level, chosen to be e = 10"^ for the rest of this article. [\\A\\2 denotes the 
spectral norm of A] 

2.2 Initialization algorithm for FIT-SSVD 

In Algorithm [l| we need a starting frame V^*^''-* such that the subspace it spans has no 
dimension that is orthogonal to the subspace spanned by the true V . Most often used is 



Input: 

1. Observed data matrix X. 

2. Target rank r. 

3. Degree of "Huberization" (3 (typically 0.95 or 0.99), 

that defines a quantile of the absolute values of entries in X. 

4. Significance level of a selection test a. 

Output: Orthornormal matrices U = U^^^ and V = V^^\ 

1 Subset selection: 

Let 6 be the /3-quantile of the absolute values of all the entries in X. 
Define Y = {yij) by i/ij = p{xij, 6), where p{x, 6) is the Huber p function: 

p{x,6) = if < b and 1b\x\ — otherwise. 
Select a subset / = {zi, •••} of rows according to the next four steps: 

- Let ti = Y7j=i Vij for i = 1, . . . , n. 

— Let p = median()f:i, t„) and s = 1.4826 MAD(ti, tn). 

- Calculate p-values: Pi = I - $(^), where $ is the CDF of A^(0, 1). 

— Perform the Holm method on the p-values at family-wise error rate a, 
and let / be the indices of the p-values that result in rejection. 

Select a subset of columns J similarly. 
Form the submatrix Xjj of size |/| x | J|. 

2 Reduced SVD: Compute r leading pairs of singular vectors of the submatrix Xjj. 
Denote them by u{, . . . , (|/| x 1 each) and vf , . . . , v/ (| J| x 1 each). 

3 Zero-padding: Create U^^^ = \ . . . , Ur°''] {n x r) and V^^^ = . . . , Vr°''] {p x r), 
such that = uj^ , uf,) = , = vj, , = 0. 



Algorithm 2: Initialization algorithm for FIT-SSVD. 
the V frame provided by the ordinary SVD. However, due to its denseness, computational 



cost and inconsistency (Shabalin and Nobel, 2010), it makes an inferior starting frame. 



Another popular choice is initialization with a random frame, which, however, is often nearly 
orthogonal to the true V and thus requires many iterations to accumulate sufficient power 
to converge. We propose therefore Algorithm [2] which overcomes these difficulties. 



The algorithm is motivated by Johnstone and Lu (2009) who obtained a consistent es 



timate for principal components under a sparsity assumption by initially reducing the di- 
mensionality. We adapt their scheme to the two-way case, and we weaken its reliance on 
the assumption of normal noise which in real data would result in too great a sensitivity 
to even slightly heavier tails than normal. To this end we make use of some devices from 
robust estimation. The intent is to perform a row- and column-preselection (Step [l| before 
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applying a classical SVD (Step [2]) so as to concentrate on a much smaller submatrix that 
contains much of the signal. We discuss the row selection process, column selection being 
analogous. 

Signal strength in rows would conventionally be measured under Gaussian assumptions 
with row sums of squares and tested with tests with p degrees of freedom. As mentioned 
this approach turns out to be much too sensitive when applied to real data matrices due to 
isolated large cells that may stem from heavier than normal tails. We therefore mute the 
influence of isolated large cells by Huberizing the squares before forming row sums. We then 
form approximate z-score test statistics, one per row, drawing on the CLT since we assume 
p (the number of entries in each row) to be large. Location and scale for the z-scores are 
estimated with the median and MAD ("median absolute deviation", instead of mean and 
standard deviation) of the row sums, the assumption being that over half of the rows are 
approximate "null rows" with little or no signal. If the signal is not sparse in terms of rows, 
this procedure will have low power, which is desirable because it biases the initialization of 
the iterative Algorithm [T] toward sparsity. Using robust 2;-score tests has two benefits over 

tests: they are robust to isolated large values, and they avoid the sensitivity of tests 
caused by their rigid coupling of expectation and variance. Finally, since n tests are being 
performed, we protect against over-detection due to multiple testing by applying Holm's 
(1979) stepwise testing procedure at a specified family-wise significance level a (default: 
5%). The end result are a set of indices I of "significant rows". — The same procedure is 
then applied to the columns, resulting in an index set J of "significant columns". 

The submatrix Xjj is then submitted to an initial reduced SVD. It is this initial reduction 
that allows the present algorithm to be faster than a conventional SVD of the full matrix 
X when the signal is sparse. The left and right singular vectors are of size |/| and |J|, 
respectively. To serve as initializations for the iterative Algorithm [T| they are expanded and 
zero-padded to length n and p, respectively (Step|3|. — This concludes the initialization 
Algorithm [2j 
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2.3 Rank estimation 



In Algorithm[TJ a required input is the presumed rank of the signal underlying X. In practice, 
we need to determine the rank based on the data. Proposals for rank estimation are the 



subject of a literature with a long history, of which we only cite Wold (1978), Gabriel (2002), 



and Hoff (2007). The proposal we chose is the bi-cross- validation (BCV) method by Owen 



and Perry (2009), but with a necessary twist. 



The original BCV method was proposed for low-rank matrices with dense singular vec- 
tors. Thus, we apply it to the submatrix Xjj obtained from the initialization Algorithm [2| 
instead of X itself. The submatrix should have much denser singular vectors and, even more 
importantly, much higher signal to noise ratio compared to the full matrix. In simulations 
not reported here but similar to those of Section [3} BCV on Xjj yielded consistent rank 
estimation when the signal was sufficiently strong for detection in relation to sparsity and 
signal-to-noise ratio. 



2.4 Threshold levels 

The tuning parameters 7 in the thresholding function 77(0;, 7) are called "threshold levels"; 
they play a key role in the procedure. At each thresholding step in Algorithm[T| a (potentially 
different) threshold level needs to be chosen for each column / = 1, ...,r of t/*^^^ and V"*^'^-' to 
strike an acceptable bias- variance tradeoff. In what follows, we focus on U^''\ while the case 
of V^''^ can be obtained by symmetry. 

The goal is to process the iterating left and right frames in such a way as to retain the 
coordinates with high signal and eliminate those with low signal. To be more specific, we 
focus on one column ^p)'™"' = Xvp~^\ Recall that X is assumed to admit an additive 
decomposition into a low-rank signal plus noise according to model ([T]). Then a theoretically 
sensible (though not actionable) threshold level for up)'*""' would be 7^; = E[||Zv^^^ ^''lloo], 
where Z is the additive noise matrix, and ||Zvp^^'' ||oo is the maximum absolute value of the 
n entries in the vector 2'v|^ ^\ The signal of any coordinate in u^^)'™"' with value less than 
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7ni could be regarded low since it is weaker than the expected maximum noise level in the 
rth rank given that there are n rows. 

The threshold 'jui as written above is of course not actionable because it involves knowl- 
edge of Z, but we can obtain information by leveraging the (presumably large) part of X 
that is estimated to have no or little signal. This can be done as follows: Let be the 
index set of rows which have all zero entries in U^''~^\ and let be its complement; define 
Ly and analogously. We may think of L„ and Ly as the current estimates of low signal 
rows and columns. Consider next a reordering and partitioning of the rows and columns of 
X according to these index sets: 



X 



(x X ^ 



(3) 



Since the entries in corresponding to are zero, only X-u^ (of size nx \ Hv\i containing 

the two left blocks in (|3])) is effectively used in the right-to- left multiplication of the iterative 
Algorithm [TJ We can therefore simulate a "near-null" situation in this block by filling it 
with random samples from the bottom right block which we may assume to have no or only 
low signal: Xi^^i^^ ^ 'Z'l^l^- Denote the result of such "bootstrap transfer" from X^^^^ 
to X-u^ by Z* in x Passing Z* through the right-to-left multiplication with v^^^ 

we form Z*^'^^J'\ which we interpret as an approximate draw from Zv^^'^"^''. We thus 
estimate HZvp""^-* ||oo with ||Z*v^Jj"'"^ ||oo, and E[||Zvp^"'"'' ||oo] with a median of ||Z*v^J^"^'' ||oo 
over multiple bootstraps of Z* . 

In order for this to be valid, the block X^^Xi, needs to be sufficiently large in relation to 
X;ii^- This is the general problem of the "m out of n" bootstrap, which was examined by 



Bickel et al. (1997). According to their results, this bootstrap is generally consistent as long 
as m = o(n). Hence, when the size |Lu||L„| of the matrix X^^^^ is large, say, larger than 
n\Hy\ \og{n\Hy\), we estimate E[||Zv['^ ^^lloo] by the median of M bootstrap replications for 
sufficiently large M. When the condition is violated, tends to be large, the central limit 
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Input: 

1. Observed data matrix X E W^^P; 

2. Previous estimators of sine; ular vectors U^''^ G V^^*^) G RP""""; 

3. Pre-specified number M of bootstraps; 

4. Estimate of the standard deviation of noise a. 
Output: Threshold level 7 G M''. 

1 Subset selection: L„ = {i : m-^^ = ... = u['^^ = 0}, L„ = {j : fjj^ = ... = fj^^ = 0}, 

Hu = L'^, = L^; 

2 if |L^||L„| < n\H^\log{n\H^,\) then 

r ; 



3 I return 7 = (3"a/2 log(?2)l G 
else 



for i ^ 1 to M do 

Sample n\Hy \ entries from Xl^l^ and reshape them into a matrix Z G M"^!-'^''!; 
B = Zl4^j G M"^''; 

Ci: = (||-B;l||oo) ||-B:2||oo) • • • ; || -B:r || oo)'; 

7; = median(C:z); 
return 7 = (71, ...,7^)'. 



Algorithm 3: The threshold level function f{X,U,V, a) for Algorithm [TJ As shown, 
the code produces thresholds for U. A call to f{X', V, U, a) produces thresholds for V. 

(k—l) 

theorem takes effect, and each element of Z\\ would be close to a normal random variable. 



Thus, the expected value of the maximum is near the asymptotic value a/2 logn times the 
standard deviation. — We have now fully defined the threshold 7^^ to be used on 
The thresholds for / = 1, ...,r are then collected in the threshold vector 7„ = {■jui, •••,7«r)'- 
A complete description of the scheme is given in Algorithm [3j Based on an extensive 
simulation study, setting the number of bootstrap replications to M = 100 yields a good 
balance between the accuracy of the threshold level estimates and computational cost. 



2.5 Alternative methods for selecting threshold levels 

In methods for sparse data, one of the most critical issues is selecting threshold levels wisely. 
Choosing thresholds too small kills off too few entries and retains too much variance, whereas 
choosing them too large kills off too many entries and introduces too much bias. To navigate 



this bias-variance trade-off, we adopted in Section 2.4 an approach that can be described as 
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a form of testing: we established max-thresholds that are unhkely to be exceeded by any 
U- or ^-coordinates under the null assumption of absent signal in the corresponding row or 
column of the data matrix. 

To navigate bias- variance trade-offs, other commonly used approaches include various 
forms of cross-validation, a version of which we adopted for the different problem of rank 



selection in Section 2.3 (bi-cross-validation or BCV according to Owen and Perry (2009)). 
Indeed, a version of cross-validation for threshold selection is used by one of the two compet- 



ing proposals with which we compare ours: Witten et al. (2009) leave out random subsets 



of the entries in the data matrix, measure the differences between the fitted values and the 
original values for those entries, and choose the threshold levels that minimize the differences. 
Alternatively one could use bi-cross-validation (BCV) for this purpose as well, by leaving 
out sets of rows and columns and choosing the thresholds that minimize the discrepancy be- 
tween the hold-out and the predicted values. However, this would be computationally slow 
for simultaneous minimization of two threshold parameters. Moreover, the possible values 
of the thresholds vary from zero to infinity, which makes it difficult to choose grid points for 



the parameters. In order to avoid such issues, Lee et al. (2010a) implement their algorithm 



by embedding the optimization of the choice of the threshold level inside the iterations that 
calculate u/ for fixed v; and v/ for fixed (unlike our methods, theirs fits one rank at a 
time). They minimize a BIG criterion over a grid of order statistics of current estimates. 
This idea could be applied to our simultaneous space-fitting approach, but the simulation 



results in Section [3j below show that the method of Lee et al. (2010a) is computationally 
very slow. 



3 Simulation results 



In this section, we show the results of numerical experiments to compare the performance 
of FIT-SSVD with two state-of-the-art sparse SVD methods from the literature (as well as 
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with the ordinary SVD). In contrast to FIT-SSVD which acquires whole subspaces spanned 
by sparse vectors simultaneously, both comparison methods are stepwise procedures that 
acquire sparse rank-one approximations (i/U/V; successively; for example, the second rank- 
one approximation ^2112^2 is found by applying the same method to the residual matrix 
X — diUi\[, and so on. For both methods it is therefore only necessary to describe how they 
obtain the first rank-one term. 



The first sparse SVD algorithm for comparison was proposed by Lee et al. (2010a) 
[referred to from here on by their initials, "LSHM"]. They obtain a first pair of sparse 
singular vectors by finding the solution to the following li penalized SVD problem 
under an I2 constraint: 



mm 

u,v,s 



|X - suv'll^ sAu||u||i sA„||v||i) . subject to ||u||2 = ||v||2 = 1. 
LSHM solve this problem by alternating between the following steps till convergence: 



(1) ui = VsoftiXvf', Xu) , ur'"^^ 

(2) ^i = 7]soft{x'ur'",Xv), vr"^ 



2 



|v;||2 



The second sparse SVD algorithm for comparison with our proposal is the adaptation 



of the penalized matrix decomposition scheme by Witten et al. (2009) to the sparse 
SVD case [referred to as "PMD-SVD" from here on]. They obtain the first pair of 
sparse singular vectors by imposing simultaneous li and I2 constraints on both vectors: 



min \\X — duv'll^ , subject to ||u||2 = ||v||2 = 1, ||u||i < s„, ||v||i < 
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The PMD-SVD algorithm iterates between the following two steps until convergence: 



(1) u = ^^"'^^j' ' , where 6^ is chosen by binary search such that ||u||i = Su - 

\\r]soft[Xv,du)\\2 

(2) V = II ^^"^^^^^ — Vth" ■> where 5y is chosen by binary search such that ||v||i = 

\\r]soft[X'vL,6^)\\2 



To make fair comparisons, we use the implementations by their original authors for both 



LSHM (Lee et al., 2010b) and PMD-SVD (Witten et al. 2010). The tuning parameters 



are always chosen automatically by the default methods in their implementations. For FIT- 
SSVD, we always use rj = rjhard in Algorithm[T} Huberization (3 = 0.95 and Holm family-wise 
error rate a = 0.05 in Algorithm [2| and M = 100 bootstraps in Algorithm [3j We did try 
different values of a, (5 and M in FIT-SSVD, and the results are not sensitive to these choices. 
Thus, in our experience there is no need for cross-validated selection of these parameters. 

In what follows, we report simulation results for situations in which the true underlying 
matrix has rank one and two, respectively. Throughout this section, the rank of the true 
underlying matrix is assumed known. 

3.1 Rank-one results 

In this part, we generate data matrices according to model ([T]) with rank r = 1, = 1024 and 
p = 2048, the singular value di ranging in {50, 100,200}, and iid noise Zij ~ (/i=0, (T^=1). 
At first glance di = 50 may appear like an outsized signal strength, but it actually is not: The 
expected sum of squares of noise is E[||Z|||,] = np ^ 2 million, whereas the sum of squares 
of signal is a comparably vanishing ||(iiUiv']^|||, = d\ = 2500, for a signal-to-noise ratio 
S/N = 0.0012 (which makes the failure of the ordinary SVD in these tasks less surprising). 
Even di = 200 amounts to a S/N = 0.012 only. 

As mentioned in the introduction, the FIT-SSVD method was motivated by theoretical 



results that were based on Gaussian assumptions (Yang et al. , 2011); it is therefore a par 



ticular concern to check the robustness of the method under noise with heavier tails than 
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Gaussian. To this end we report simulation results both for A^(0, 1) and \/S/5tr, noise, the 
latter also having unit variance (the purpose of the factor a/3/5). 

For the construction of meaningful singular vectors we use a functional data analysis 
context: We choose functions gleaned from the literature and represent them in wavelet bases 
where they feature realistic degrees of sparsity. In Figure [l| Plot (a) ( "peak" ) shows the 
graph of a function with three peaks, evaluated at 1024 equispaced locations, while Plot (b) 
("poly") shows a piecewise polynomial function, evaluated at 2048 equispaced locations. 
Both functions create dense evaluation vectors but sparse wavelet coefficient vectors. [In all 



simulation results reported below, we use Symmelet 8 wavelet coefficients (Mallat, 2009). 



Multi-resolution plots of the wavelet coefficients are shown in Plots (c) ("wc-peak") and (d) 
("wc-poly") of Figure [!} We choose ui and vi to be the wavelet coefficient vectors wc-peak 
and wc-poly, respectively. 



0.25 - 
0.2 - 

0.15 - 
0.1 ^ 

0.05 ■ 



(a) Three peak function 



0.2 0.4 0.6 0.8 

Location 

(c) DWT of 3-peaW function (Symmiet 8). 



0.08 
0.06 
0.04 
0.02 


-0.02 
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(b) Piecewise polynomiai function 




0.4 0.6 
Location 



(d) DWT of piecewise polynomiai function (Synnmiet 8). 



0.4 0.6 
Location 



0.4 0.6 
Location 



Figure 1: (a) peak: three-peak function evaluated at 1024 equispaced locations; (b) poly: 
piecewise polynomial function evaluated at 2048 equispaced locations; (c) wc-peak: discrete 
wavelet transform (DWT) of the three-peak function; (d) wc-poly: DWT of the piecewise 
polynomial function. In Plot (c) and (d), each vertical bar is proportional in length to the 
magnitude of the Symmiet 8 wavelet coefficient at the given location and resolution level. 



For each simulated scenarios, we ran 100 simulations, applied each algorithm under com- 
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losses 


di 


FIT-SSVD 


LSHM 


PMD-SVD 


SVD 






median MAD 


median MAD 


median MAD 


median MAD 


-^space (ui, Ul) 


50 
100 
200 


0.0513 0.0009 
0.0127 0.0003 
0.0036 0.0001 


0.0669 0.0014 
0.0159 0.0004 
0.0044 0.0001 


0.0783 0.0007 
0.0254 0.0002 
0.0102 0.0000 


0.5225 0.0034 
0.1114 0.0005 
0.0264 0.0001 


space ( Vl, Vi) 


50 
100 
200 


0.0958 0.0008 
0.0325 0.0004 
0.0112 0.0001 


0.1095 0.0016 
0.0385 0.0005 
0.0131 0.0002 


0.1399 0.0008 
0.0566 0.0003 
0.0241 0.0001 


0.6330 0.0025 
0.1878 0.0006 
0.0499 0.0001 


L(S,H) 


50 
100 
200 


0.1454 0.0014 
0.0457 0.0004 
0.0149 0.0001 


0.1726 0.0019 
0.0549 0.0007 
0.0177 0.0003 


0.3280 0.0016 
0.0973 0.0003 
0.0364 0.0001 


2.2217 0.0082 
0.3709 0.0009 
0.0805 0.0002 


l|ui||o 


50 
100 
200 


24 0.1483 
34 0.1483 
43 0.1483 


22 0.2965 
32 0.2965 
41 0.2965 


242.5 1.4085 
372.5 1.4085 
577.0 1.3343 


1024 
1024 
1024 


l|vi||o 


50 
100 
200 


18 0.2965 
40 0.2965 
66 0.4448 


14 0.2965 
38 0.4448 
64 0.6672 


535.0 2.2239 
854.5 1.7050 
1303.0 2.0756 


2048 
2048 
2048 


time 


50 
100 
200 


0.3364 0.0096 
0.4401 0.0209 
0.5685 0.0360 


36.7316 0.4497 
30.8268 0.3305 
23.7639 0.2542 


2.0578 0.0129 
1.9607 0.0124 
1.9274 0.0102 


1 
1 
1 



Table 1: Comparison of four methods in the rank-one case: Ui is wc-peak, Vi is wc-poly, 
and the noise is iid N(0,1). 

parison, and summarized the results in terms of median and MAD-based standard error. 
The criteria which we use for comparison of the methods are best explained with reference 
to Table [T| where we report the results for iid A^(0, 1) noise Z: 

• The first block examines the estimation accuracy of the left singular vector, with the 



three rows corresponding to three different values of di. Following Ma (2011), we 
define the loss function for estimating the column space of U for a general rank-r 
by Lspace{U,U) = \\Pu — PuWl ' where Pu = UU' is the projection matrix onto the 
subspace spanned by the columns of U (which is of size n x r and has orthonormal 
columns, U'U = Ir)- In the rank-one case here, the loss reduces to sin^ Z(ui,ui). 

The second block in Table [T] reports the loss for right singular vectors. 

The third block shows the scaled recovery error for the low-rank signal matrix S = 
UDV\ defined as L(S, S) = ||S - Here, H = UDV' and t) = diag{di, . . . , 4 
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with diagonal entries being di = u^Xv^. 

• The fourth and fifth panels of Table [T] show the sparsity of the solutions measured by 
||ui||o and ||vi||o, that is, the number of nonzero elements in the estimates. 

• The last block shows timing results as a fraction or multiple of the ordinary SVD. 
The results are as follows: 

• From the first three blocks we see that FIT-SSVD uniformly outperforms the other 
methods with respect to the three statistical criteria. While LSHM is not far behind 
FIT-SSVD, PMD-SVD lags in several cases by a factor of two or more. The ordinary 
SVD fails entirely for low signal strength as the results for di = 50 illustrate, impressing 
the need to leverage sparsity in such situations. Rather expectedly, all methods achieve 
better statistical accuracy as the signal strength di increases 

• As for the sparsity metrics, FIT-SSVD and LSHM produce similar levels of sparsity, 
while PMD-SVD estimators are much denser. The results also suggest that as the signal 
strength di gets stronger, the three sparse SVD methods estimate more coordinates. 

• Finally, the timing results indicate that FIT-SSVD is faster than all other methods, 
the ordinary SVD included. LSHM stands out as slower than FIT-SSVD by factors 
of over 40 to over 100. PMD-SVD is more competitive but still at least a factor of 
three slower than FIT-SSVD. The variation in time for PMD-SVD is small because 
the majority is spent in cross-validation. 

To examine the effect of heavy-tailed noise, we report in Table [2] the simulation results 
when the entries of the noise matrix Z are distributed iid a/3/5 ts, all else being the same 
as in Table 1 [Recall that the scaling factor a/3/5 is used to ensure unit variance.] The 
statistical performance for all methods is worse than in Table [1} In terms of the statistical 
metrics, the performances of FIT-SSVD and LSHM are in a statistical dead heat, whereas 
PMD-SVD trails behind by as much as a factor of two in the case of high signal strength, 
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losses 


di 


FIT-SSVD 


LSHM 


PMD-SVD 


SVD 






median MAD 


median MAD 


median MAD 


median MAD 


-^space (ui, Ul) 


50 
100 
200 


0.0802 0.0015 
0.0177 0.0003 
0.0048 0.0001 


0.0819 0.0017 
0.0180 0.0004 
0.0047 0.0001 


0.0907 0.0011 
0.0282 0.0003 
0.0108 0.0001 


0.5405 0.0037 
0.1115 0.0006 
0.0262 0.0002 


space ( Vl, Vl) 


50 
100 
200 


0.1193 0.0014 
0.0451 0.0005 
0.0145 0.0002 


0.1191 0.0018 
0.0415 0.0007 
0.0137 0.0002 


0.1560 0.0014 
0.0601 0.0003 
0.0249 0.0001 


0.6432 0.0039 
0.1870 0.0006 
0.0498 0.0002 


L(S,H) 


50 
100 
200 


0.1944 0.0024 
0.0625 0.0007 
0.0192 0.0002 


0.1937 0.0028 
0.0600 0.0009 
0.0187 0.0002 


0.3719 0.0028 
0.1041 0.0005 
0.0378 0.0001 


2.2624 0.0107 
0.3706 0.0011 
0.0805 0.0002 


l|ui||o 


50 
100 
200 


20 0.2965 
31 0.1483 
40 0.1483 


21.5 0.3706 
33.0 0.2965 
41.0 0.2965 


235.0 1.5567 
364.0 1.8532 
569.5 2.0015 


1024 
1024 
1024 


l|vi||o 


50 
100 
200 


13 0.1483 
31 0.2965 
56 0.2965 


14.0 0.2965 
38.5 0.5189 
64.5 0.6672 


526.0 2.0015 
841.5 2.1498 
1307.5 2.2980 


2048 
2048 
2048 


time 


50 
100 
200 


0.3714 0.0190 
0.8072 0.0652 
0.8238 0.0710 


41.0695 0.9339 
30.5216 0.3774 
23.0527 0.2554 


2.0826 0.0046 
2.0187 0.0039 
1.9520 0.0048 


1 
1 
1 



Table 2: Comparison of four methods in the rank-one case: Ui is wc-peak, Vi is wc-poly, 
and the noise is lid a/sTs^s- 



di = 200. Again, FIT-SSVD and LSHM have comparable sparsities, whereas PMD-SVD is 
much denser. In terms of computation time, again FIT-SSVD is uniformly fastest, followed 
by PMD-SVD which trails by factors of over two to over five, and LSHM being orders of 
magnitude slower (by factors of 28 to 110). 



3.2 Rank- two results 

We show next simulation results for data according to model ([T]) with r = 2, and again 
n = 1024 and p = 2048. The singular values (^1,^2) range among the pairs (100,50), 
(200,50), and (200,100). The singular vectors are ui = wc-peak, vi =wc-poly, U2 = 
wc-step, and V2 = wc-sing, the properties of the latter two vectors being shown in Figure [2] 
Table [3] reports the results from 100 repetitions when the noise is iid A^(0, 1). In terms 
of statistical metrics, FIT-SSVD always outperforms LSHM though not hugely. PMD-SVD 
does slightly better than FIT-SSVD for Lspace{U,U), but much worse for Lspace{V,V) and 
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(b) Single singularity function 



0.2 0.4 0.6 0.8 1 

Location 

(d) DWT of single singularity function (Symmlet 8). 



0.2 0.4 0.6 0.8 1 

Location 

Figure 2: (a) step: step function evaluated at 1024 equispaced locations, (b)sing: single sin- 
gularity function evaluated at 2048 equispaced locations, (c)wc-step: DWT of step function, 
(d)wc-sing: DWT of single singularity function. 

L(S,S). This is due to the special type of cross-validation used in the package PMA: the 
parameters Su, Sy are set to be proportional to each other after being scaled according to the 
dimensionality, y/n and ^^p, which essentially reduces the simultaneous cross-validation on 
two parameters to one. Therefore, PMD-SVD actually enforces the same level of sparsity on 
u and V. 

In terms of sparsity of the estimators, the fourth and fifth blocks show the cardinality 
of the joint support of the estimated singular vectors, which indicate that FIT-SSVD and 
LSHM are again about comparable, and PMD-SVD is much denser as in the rank-one case. 
[We do not compare the losses and the Iq norms for individual singular vectors because LSHM 
and PMD-SVD do not produce orthogonal singular vectors.] 

Finally, in terms of computation time, FIT-SSVD dominates again, and the differences 
become somewhat more pronounced than in the rank-one case. In the high signal scenario, 
di = 200 and ^2 = 100, FIT-SSVD gets a boost because by avoiding the costly bootstrap in 
Algorithm |3] because Condition [2] is satisfied and the much cheaper normal approximation 
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(a) Step function 
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(c) DWT of step function (Symmlet 8). 
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losses 




d2 


FIT-SSVD 


LSHM 


PMD-SVD 


SVD 






median 


MAD 


median 


MAD 


median 


MAD 


median 


MAD 




100 


50 


0.1163 


0.0010 


0.1413 


0.0021 


0.1022 


0.0009 


0.5315 


0.0037 


-^bpciccy^ ? ^ / 


200 


50 


0.1148 


0.0013 


0.1422 


0.0018 


0.1007 


0.0009 


0.5265 


0.0027 


200 


100 


0.0376 


0.0003 


0.0443 


0.0006 


0.0321 


0.0003 


0.1114 


0.0005 




100 


50 


0.0514 


0.0009 


0.0596 


0.0010 


0.1230 


0.0008 


0.6376 


0.0029 


/ J orrtn r-c ( V . V ) 


200 


50 


0.0506 


0.0009 


0.0601 


0.0011 


0.1259 


0.0006 


0.6293 


0.0023 


200 


100 


0.0144 


0.0002 


0.0172 


0.0003 


0.0538 


0.0002 


0.1870 


0.0005 




100 


50 


0.0691 


0.0006 


0.0825 


0.0007 


0.1403 


0.0004 


0.7439 


0.0017 




200 


50 


0.0234 


0.0001 


0.0285 


0.0002 


0.0529 


0.0001 


0.2070 


0.0005 




200 


100 


0.0228 


0.0001 


0.0261 


0.0002 


0.0483 


0.0001 


0.1387 


0.0003 




100 


50 


49 


0.2965 


45.0 


0.4448 


479 


1.1861 


1024 







200 


50 


56 


0.2965 


49.0 


0.2965 


649 


1.5567 


1024 







200 


100 


77 


0.2965 


73.5 


0.5189 


657 


1.3343 


1024 







100 


50 


54 


0.2965 


50.5 


0.3706 


1158.0 


2.2239 


2048 





\supp{yi) U supp{-V2)\ 


200 


50 


78 


0.2965 


74.5 


0.5189 


1486.5 


2.1498 


2048 







200 


100 


81 


0.4448 


82.5 


0.5930 


1623.0 


2.1498 


2048 







100 


50 


1.1675 


0.0829 


64.7840 


0.6037 


2.7991 


0.0141 


1 





time 


200 


50 


1.4572 


0.1011 


55.6839 


0.5436 


2.7018 


0.0142 


1 







200 


100 


0.8000 


0.0668 


54.2361 


0.2363 


2.6429 


0.0073 


1 






Table 3: Comparison of four methods for the rank-two case, with noise iid A^(0, 1). 



on Line [3] of Algorithm [3] can be used to compute the threshold level. Since LSHM repeats 
its scheme on the residual matrix to get the second layer of SVD, computation time doubles. 
As for PMD-SVD, since the time is mainly spent in cross-validation and the same penalty 
parameter is used for different ranks, the increase in time is not obvious. 



4 Real data examples 

All the methods mentioned above require sparse singular vectors (with most entries close to 
zero). One source of such data is two-way functional data whose row and column domains 
are both structured, for example, temporally or spatially, as when the data are time series 
collected at different locations in space. Two-way functional data are usually smooth as func- 
tions of the row and column domains. Thus, if we expand them in suitable basis functions. 



such as an orthonormal trigonometric basis, the coefficients should be sparse (Johnstone 
20TT|. 
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FIT-SSVD 


LSHM 


PMD-SVD 


SVD 







82 


48 


96 


96 


11^2 





86 


56 


7 


96 


l|vi 





66 


45 


75 


75 


I|V2 





70 


45 


43 


75 



Table 4: Mortality data: number of nonzero coordinates in the transformed domain for four 
methods. 

4.1 Mortality rate data 

As our first example we use the US mortality rate data from the Berkeley Human Mortality 
Database (http://www.mortality.org/). They contain mortality rates in the United States 
for ages to 110 from 1933 to 2007. The data for people older than 95 was discarded 
because of their noisy nature. The matrix X is of size 96 x 75, each row corresponding to 
an age group and each column to a one-year period. We first pre- and post- multiply the 
data matrix with orthogonal matrices whose columns are the eigenvectors of second order 
difference matrices of proper sizes; the result is a matrix of coefficients of the same size as 



X. The rank of the signal is estimated to be 2 using bi-crossvalidation (Section 2.3). We 
then applied FIT-SSVD, LSHM, PMD-SVD and ordinary SVD to get the first two pairs of 
singular vectors. Finally, we transformed the sparse estimators of the singular vectors back 
to the original basis to get smooth singular vectors. 

The estimated number of nonzero elements in each singular vector (before the back trans- 
formation) is summarized in Table |4j none gives very sparse solutions. This is reasonable, 
because the mortality rate data is of low noise and for data with no noise we should just 
use the ordinary SVD. Because this data is of small size, it only takes a few seconds for all 
the algorithms. The plot of singular vectors for all the methods are shown in Figures [3] to [7j 
The red dashed line in the left plot is for FIT-SSVD, in the middle for LSHM, and on the 
right for PMD-SVD. We use the wider gray curve for the ordinary SVD as a reference. 

Figure [3] shows the first left singular vector plotted against age. The curve ui shows a 
pattern for mortality as a function of age: a sharp drop between age and 2, then a gradual 
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decrease till the teen years, flat till the 30s, after which begins an exponential increase. 
Figure |4] zooms into the lower left corner of Figure [3] to show the details between age and 
10. LSHM, as always turns out to be the sparsest (or smoothest) among the three iterative 
procedures in the transformed (or original) domain. We believe that FIT-SSVD and PMD- 
SVD make more sense based on a parallel coordinates plot of the raw data (not shown here), 
in which the drop in the early age appears to be sharp and therefore should not be smoothed 
out. Figure [5] shows the first right singular vectors plotted against year. It implies that 
mortality decreases with time. All of the panels show a wiggly structure, with LSHM again 
being the smoothest. Here, too, we believe that the zigzag structure is real and not due to 
noise in the raw data, based again on a parallel coordinate plot of the raw data. The zigzags 
may well be systematic artifacts, but they are unlikely to be noise. 



(a) mortality rate: FIT-SSVD (b) mortality rate: LSHM (c) mortality rate: PMD-SVD 




20 40 60 80 20 40 60 80 20 40 60 80 

age age age 

Figure 3: Mortality data: plot of Ui. Panel (a): FIT-SSVD vs. SVD; Panel (b): LSHM vs. 
SVD; Panel (c): PMD-SVD vs. SVD. 

The second pair of singular vectors is shown in Figures |6] and [7j They correct the pattern 
that the first pair of singular vectors does not capture. The contrast mainly focuses on 
people younger than 2 or between 60 and 90 where U2 is positive. Also, V2 has extreme 
negative or positive values towards the both ends, 1940s and 2000s. Together, they suggest 
that babies and older people had lower mortality rates in the 1940s and higher mortality 
rates in the 2000s than what the first component expresses. One final aspect to note is the 
strange behavior of U2,pmd-svd, recalling that ui,pmd-svd,<^i,paid^svd,<^2,pmd-svd all 
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(a) mortality rate: FIT-SSVD 



(b) mortality rate: LSHI\/I 



(c) mortality rate: PIUID-SVD 




Figure 4: Mortality data: Plot of ui. Zoom of the lower left corner of Figure [3] Everything 
else is the same as in Figure [3] 



(a) mortality rate: FIT-SSVD 



(b) mortality rate: LSHM 



(c) mortality rate: PMD-SVD 
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Figure 5: Mortality data: Plot of vi. Everything else is the same as in Figure |3| 
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follow the ordinary SVD very closely. We think this is again due to the cross-validation 
technique they use. 

(b) mortality rate: LSHM 



(a) mortality rate: FIT-SSVD 




(c) mortality rate: PMD-SVD 
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Figure 6: Mortality data: plot of U2. Everything else is the same as Figure [31 



(a) mortality rate: FIT-SSVD 



(b) mortality rate: LSHM 



(c) mortality rate: PMD-SVD 
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Figure 7: Mortality data: plot of V2. Everything else is the same as Figure [3j 



Huang et al. (2009) also used the mortality data from 1959 to 1999 to illustrate their 
version of regularized SVDs to get smooth singular vectors by adding second order difference 
penalties. If we compare the results shown in this section with theirs, our solutions lack the 
smoothness of their solutions, but we think we recover more information from the data by 
capturing not only the general trend but also local details such as year-to-year fluctuations. 



4.2 Cancer data 

We consider next another data example where some sparse structure may be expected to 



exist naturally. The cancer data used by Lee et al. (2010a) (who in turn have them from 
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Liu et al. (2008)) consists of the gene expression levels of 12,625 genes for 56 cases of four 



different types of cancer. It is believed that only a part of the genes regulate the types and 
hence the singular vectors corresponding to the genes should ideally be sparse. We apply 
the four SVD methods directly to the raw data without change of basis. 

Before we proceed it may be proper to discuss briefly some modeling ambiguities posed 
by this dataset as it is not a priori clear whether a PCA or SVD model is more appropriate. 
It might be argued that the cases really should be considered as being sampled from a pop- 
ulation, hence PCA would be the proper analysis, with the genes representing the variables. 
The counter argument is, however, that the cases are stratified, and the strata are pure 
convenience samples of sizes that bear no relation to the sizes of naturally occurring cancer 
populations. A dual interpretation with genes as samples and CcLSGS cLS variables would be 
conceivable also, but it seems even more far fetched in the absence of any sampling aspect 
with regard to genes. In light of the problems raised by any sampling assumption, it would 
seem more appropriate to condition on the cases and the genes and adopt a fixed effects 
view of the data. As a result the SVD model seems less problematic than either of the dual 
PCA models. 

We first attempted to estimate the rank of the signal using bi-crossvalidation (Sec- 
tion 2.3), but it turns out that the rank is sensitive to the choice of a (Holm family-wise 



error) and /3 (Huberization quantile) in Algorithm |2| ranging from r=3 to r=5. We decided 
to use r = 3 because this is the number of contrasts required to cluster the cases into four 



groups. Also, this is the rank used by Lee et al. (2010a), which grants comparison of their 
and our results. 

On a different note, running LSHM on these data with rank three took a couple of hours, 
which may be a disincentive for users to seek even higher ranks. The hours of run time of 
LSHM compares with a few minutes for PMD-SVD and merely a few seconds for FIT-SSVD. 
(In addition, LSHM's third singular vectors do not seem to converge within 300 iterations.) 

Table [5] summarizes the cardinalities of the union of supports of three singular vectors 
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FIT-SSVD 


LSHM 


PMD-SVD 


SVD 


1 Uf^i supp(uO| 


4688 


4545 


12625 


12625 


1 Uf=i supp(vi)| 


56 


56 


54 


56 



Table 5: Cancer data: summary of cardinality of joint support of three singular vectors for 
four methods. 

for each method. For the estimation of left singular vectors corresponding to different genes, 
the PMD-SVD solution is undesirably dense, while FIT-SSVD and LSHM give similar levels 
of sparsity. For the estimation of right singular vectors corresponding to the cases, we would 
expect that all cases have their own effects rather than zero, so it is not surprising that the 
estimated singular vectors are dense. 

Figure [8] shows the scatterplots of the entries of the first three right singular vectors for the 
four methods. Points represent patients, each row represents one method, and each column 
corresponds to two of the three singular vectors. The four known groups of patients are 
easily discerned in the plots. A curiosity is the cross-wise structure produced by PMD-SVD, 
where the singular vectors are nearly mutually exclusive: if one coordinate in a singular 
vector is non-zero, most corresponding coordinates in the other singular vectors are zero. 
The other three methods, including the ordinary SVD, agree strongly among each other in 
the placement of the patients. The agreement with the ordinary SVD is not a surprise as 
j9 = 56 is a relatively small column dimension on which sparsity may play a less critical role 
compared to the row dimension with n = 12625. Yet, the three sparse methods give clearer 
evidence that the carcinoid group (black cirlces) falls into two subgroups than the ordinary 
SVD. According to FIT-SSVD and LSHM the separation is along V3 (center and right hand 
plots), whereas according to PMD-SVD it is by lineup with vl and v2, respectively (left 
hand plot). 

Figure |9] shows checkerboard plots of the reconstructed rank-three approximations, layed 
out with patients on the vertical axis and genes on the horizontal axis. Each row of plots 
represents one method, and the plots in a given row show the same reconstructed matrix 
but successively ordered according to the coordinates of the estimated left singular vectors 
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Figure 8: Cancer data: Scatterplots of the entries of the first three right singular vectors 
V;,/ = 1,2,3 for four methods. Points represent patients. Black circle: Carcinoid; Red 
triangle: Colon; Green cross: Normal; Blue diamond: SmallCell. 
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Ui, U2 and U3. There are fewer than 5000 genes shown for FIT-SSVD and LSHM, because 
the rest are estimated to be zero, whereas all 12,625 genes are shown for PMD-SVD and 
SVD (Table [5]). We can see clear checkerboard structure in some of the plots, indicating 
biclustering. In spite of the strong similarity between the patient projections for FIT-SSVD 
and LSHM in Figure [8} there is a clear difference between these methods in the recon- 
structions in Figure [9] The FIT-SSVD solution exhibits the strongest block structure in its 
U2-based sort (center plot, top row), implying the strongest evidence of clustering among its 
non-thresholded genes. Since these blocks consist of many hundreds of genes, it would sur- 
prisingly suggest that the differences between the four patient groups run into the hundreds, 
not dozens, of genes. 

In spite of the differences in checkerboard patterns in Figure [9} the three left singular 
vectors are highly correlated between FIT-SSVD and LSHM: corr = 0.985, 0.981, and 0.968, 
respectively, and the top 20 genes with largest magnitude in the estimated three left singular 
vectors of FIT-SSVD overlap with those of LSHM except for one gene in the second singular 
vector. These shared performance aspects notwithstanding, the two methods differ hugely 
in computing time, FIT-SSVD taking seconds, LSHM taking a couple of hours. 

5 Discussion 

We presented a procedure, called FIT-SSVD, for the fast and simultaneous extraction of 
singular vectors that are sparse in both the row and the column dimension. While the 
procedure is state of the art in terms of statistical performance, its overriding advantage 
is sheer speed. The reasons why speed matters are several: (1) Faster algorithms enable 
the processing of larger datasets. (2) The use of SVDs in data analysis is most often for 
exploratory ends which call for unlimited iterations of quickly improvised steps — something 
that is harder to achieve as datasets grow larger. (3) Sparse multivariate technology is still 
a novelty and hence at an experimental stage; if its implementation is fast, early adopters 



30 



FIT-SSVD: ordered by fit-ssvd FIT-SSVD: ordered by U2piT_sgvD FIT-SSVD: ordered by Ujpu.gsvo 




1000 2000 3000 4000 1000 2000 3000 4000 1000 2000 3000 4000 

genes genes genes 



LSHM: ordered by lshm LSHM: ordered by U2 lshm LSHM: ordered by U3 lshh 




1000 2000 3000 4000 1000 2000 3000 4000 1000 2000 3000 4000 

genes genes genes 



PMD-SVD: ordered by pmd-svd PMD-SVD: ordered by U2 pmd svd PMD-SVD: ordered by U3 pyo_svD 




2000 BOOO 10000 2000 6000 10000 2000 6000 10000 

genes genes genes 



SVD: ordered by u I svD SVD: ordered by U2 svd SVD: ordered by U3 svd 




2000 6000 10000 2000 6000 10000 2000 6000 10000 

genes genes genes 



Figure 9: Cancer data: Image plots of the rank-three approximations Yli=i 2 3 diui^'i whose 
values are gray-coded. Each image is laid out clS CclSGS (= rows) by genes (= columns). The 
same rank-three approximation is shown three times for each method (left to right), each time 
sorted according to a different u; {I = 1, 2, 3). (The mapping of the rank-three approximation 
values to gray scales is by way of a rank transformation, using a separate transformation for 
each image. Rank transformations create essentially uniform distributions that better cover 
the range of gray scale values.) 
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of the technology have a better chance to rapidly gain experience by experimenting with its 
parameters. (4) If a statistical method such as sparse SVD has a fast implementation, it 
can be incorporated as a building block in larger methodologies, for example, in processing 
data arrays that are more than two-dimensional. For these reasons we believe that fast SVD 
technology is of the essence for its success. 

A unique opportunity for sparse approaches is to achieve faster speed than standard non- 
sparse approaches when the structure in the data is truly sparse. Our algorithm achieves 
this to some extent through initialization that is both sparse and smart: sparse initializa- 
tion consists of a standard SVD of smaller size than the full data matrix, while smart (in 
particular: non-random) initialization reduces the number of iterations to convergence. A 
statistical benefit is that inconsistent estimation by the standard SVD on large data matrices 
with weak signal is avoided. — An imperative for fast implementations is avoiding where 
possible such slow devices as cross-validation. A considerable speed advantage we achieve 
is through relatively fast (non-crossvahdated) selection of thresholding levels based on an 
analytical understanding of their function. 

Our proposal has conceptual and theoretical features that are unique at this stage of 
the development of the field: (1) FIT-SSVD extracts r orthogonal left- and right-singular 
vectors simultaneously, which puts it more in line with standard linear dimension reduction 
where orthogonal data projections are the norm. In addition, simultaneous extraction can 
be cast as subspace extraction, which provides a degree of immunity to non-identifiability 
and slow convergence of individual singular vectors when some of the first r underlying sin- 
gular values are nearly tied: since we measure convergence in terms of distance between 
successive r-dimensional subspaces, our algorithm does not need to waste effort in pinning 
down ill-determined singular vectors as long as the left- and right-singular subspaces are 
well-determined. Such a holistic view of the rank-r approximation is only available to si- 
multaneous but not to successive extraction. (2) FIT-SSVD is derived from asymptotic 
theory that preceded its realization as a methodology: For Gaussian noise in the model 
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([T]), we (Yang et al. , 2011) showed that our algorithm with appropriately chosen parameters 
achieves the rate of the minimax lower bound. In other words, in a specific parameter space, 
our algorithm is asymptotically optimal in terms of minimizing the maximum possible risk 
over the whole parameter space. 

As for future work, the current state of the art raises several questions. For one, it would 
be of interest to better understand the relative merits of the currently proposed sparse 
SVD approaches since they have essential features in common, such as power iterations 
and thresholding. Another natural question arises from the fact that sparse SVDs build 
on the sequence model: many methods for choosing parameters from the data have been 
shown to be asymptotically equivalent to first order in the sequence model (see, e.g., Haerdle 
et al. (1988)), including cross-validation, generalized cross-validation. Rice's method based 
on unbiased estimates of risk, final prediction error, and the Akaike information criterion. 
Do these asymptotic equivalences hold in the matrix setting for sparse SVD approaches? 
How does the choice of the BIG in LSHM compare? Also, our algorithm and underlying 
theory allow a wide range of thresholding functions: Is there an optimal choice in some 
sense? Further, there exists still a partial disconnect between asymptotic theory and practical 
methodology: The theory requires a strict rank r model, whereas by all empirical evidence the 
algorithm works well in a "trailing rank" situation where real but small singular values exist. 
Finally, there is a robustness aspect that is specific to sparse SVD approaches: heavier than 
normal tails in the noise distribution generate "random factors" caused by single outlying 
cells. While we think we have made reasonable and empirically successful choices in drawing 
from the toolkit of robustness, we have not provided a theoretical framework to justify them. 
— Just the same, even if the proposed FIT-SSVD algorithm may be subject to some future 
tweaking, in the substance it has the promise of lasting merit. 
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